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Abstract 

Online learning algorithms that minimize regret provide strong guarantees in 
situations that involve repeatedly making decisions in an uncertain environment, 
e.g. a driver deciding what route to drive to work every day. While regret min- 
imization has been extensively studied in repeated games, we study regret mini- 
mization for a richer class of games called bounded memory games. In each round 
of a two-player bounded memory-m game, both players simultaneously play an 
action, observe an outcome and receive a reward. The reward may depend on the 
last m outcomes as well as the actions of the players in the current round. The 
standard notion of regret for repeated games is no longer suitable because actions 
and rewards can depend on the history of play. To account for this generality, we 
introduce the notion of k-adaptive regret, which compares the reward obtained by 
playing actions prescribed by the algorithm against a hypothetical k-adaptive ad- 
versary with the reward obtained by the best expert in hindsight against the same 
adversary. Roughly, a hypothetical fc-adaptive adversary adapts her strategy to the 
defender's actions exactly as the real adversary would within each window of k 
rounds. Our definition is parametrized by a set of experts, which can include both 
fixed and adaptive defender strategies. 

We investigate the inherent complexity of and design algorithms for adaptive 
regret minimization in bounded memory games of perfect and imperfect informa- 
tion. We prove a hardness result showing that, with imperfect information, any 
fc-adaptive regret minimizing algorithm (with fixed strategies as experts) must be 
inefficient unless NP = RP even when playing against an oblivious adversary. 
In contrast, for bounded memory games of perfect and imperfect information we 
present approximate 0-adaptive regret minimization algorithms against an oblivi- 
ous adversary running in time n°'^\ 
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1 Introduction 



Online learning algorithms that minimize regret provide strong guarantees in situations 
that involve repeatedly making decisions in an uncertain environment. There is a well 
developed theory for regret minimization in repeated games[?]. The goal of this paper 
is to study regret minimization for a richer class of settings. As a motivating example 
consider a store that sells perishable goods (e.g., bread, milk, eggs) and faces a series 
of different customers every k rounds. The store manager may be uncertain about the 
future demand for certain goods. Nevertheless, each round the store manager must de- 
cide how to price the goods and whether or not to restock. Another motivating example 
involves developing effective auditing strategies in an adversarial environment: Con- 
sider a hospital (defender) where a series of different employees or business affiliates 
(adversary) access patient records for legitimate purposes (e.g., treatment or payment) 
or inappropriately (e.g., out of curiosity about a family member or for financial gain). 
The hospital wants to minimize its overall loss by balancing the cost of audits with the 
risk of externally detected violations. In these settings, a reasonable strategy for the 
defender (manager) is one that minimizes her regret. 

Modeling these scenarios as a repeated game of imperfect information is challeng- 
ing because the games have two additional characteristics that are not captured by a 
repeated game model: (1) History-dependent rewards: The payoff function depends 
not only on the current outcome but also on previous outcomes. For example, a poorly 
stocked store may not benefit from a sudden surge in the demand for bread. (2) History- 
dependent actions: Both players may adapt their strategies based on history. 

Instead we capture this form of history dependence by introducing bounded mem- 
ory games, a subclass of stochastic games In each round of a two-player bounded- 
memory-m game, both players simultaneously play an action, observe an outcome and 
receive a reward. In contrast to a repeated game, the payoffs may depend on the state 
of the game. In contrast to a general stochastic game, the rewards may only depend on 
the outcomes from the last m rounds (e.g., milk stocked more than m rounds ago will 
have expired) as well as the actions of the players in the current round. 

In a bounded memory game, the standard notion of regret for a repeated game is not 
suitable because the adversary may adapt her actions based on the history of play. To 
account for this generaUty, we introduce (in Section |3]l the notion of k-adaptive regret, 
which compares the reward obtained by playing actions prescribed by the algorithm 
against a hypothetical k-adaptive adversary with the reward obtained by the best expert 
in hindsight against the same adversary. Roughly, a hypothetical /c-adaptive adversary 
plays exactly the same actions as the real adversary except in the last k rounds where 
she adapts her strategy to the defender's actions exactly as the real adversary would. 
When fc = 0, this definition coincides with the standard definition of an oblivious 
adversary considered in defining regret for repeated games. When fc = oo we get a 
fully adaptive adversary. A fc-adaptive adversary is a natural model for the series of 

'stochastic games [?] are expressive enougli to model liistory dependence. However, there is no regret 
minimization algorithm for the general class of stochastic games. While we do not view this result as 
surprising or novel, we include it in the appendix for completeness (Theorem 11). We also prove that fully 
adaptive regret minimization algorithms do not exist for bounded-memory games following the impossibility 
result for stochastic games. 
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different customers in the chainstore game, different bidders in a repeated auction or 
different employees in a hospital audit. Our definition is parameterized by a set of 
experts, which can include both fixed and adaptive defender strategies. 

Next, we investigate the inherent complexity of and design algorithms for adaptive 
regret minimization in bounded-memory games of perfect and imperfect information. 
Our results are summarized in Table [T] We prove a hardness result (Section |4j The- 
orem [T]) showing that, with imperfect information, any fc-adaptive regret minimizing 
algorithm (with fixed strategies as experts) must be inefficient unless NP — RP even 
when playing against an oblivious adversary and even when fc = 0. In fact, the result 
is even stronger and applies to any 7-approximate fc-adaptive regret minimizing algo- 
rithm (ensuring that the regret bound converges to 7 rather than as the number of 
rounds T — ?> 00) for 7 < where n is the number of states in the game and j3 > Q. 
Our hardness reduction from MAX3SAT uses the state of the bounded-memory game 
and the history-dependence of rewards in a critical way. 

We present an inefficient fc-adaptive regret minimizing algorithm by reducing the 
bounded-memory game to a repeated game. The algorithm is inefficient for bounded- 
memory games when the number of experts is exponential in the number of states of 
the game (e.g., if all fixed strategies are experts). In contrast, for bounded-memory 
games of perfect information, we present an efficient rP'^'^/^^ time 7-approximate 0- 
adaptive regret minimization algorithm against an oblivious adversary for any constant 
7 > (Section |5]Theorem|4|. We also show how this algorithm can be adapted to 
get an efficient 7-approximate 0-adaptive regret minimization algorithm for bounded- 
memory games of imperfect information (Section[5jTheorem[5]). The main novelty in 
these algorithms is an implicit weight representation for an exponentially large set of 
adaptive experts, which includes all fixed strategies. 
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Table 1 : Regret Minimization in Bounded Memory Games 

X - no regret minimization algorithm exists 

Hard - unless NP = RP no regret minimization algorithm is efficiently computable 
APX - efficient approximate regret minimization algorithms exist. 



Related Work Stochastic games were defined by Shapley [?]. Much of the work 
on stochastic games has focused on finding and computing equilibria for these games 
[?, ?]. Regret minimization in stochastic games has not been the subject of much 
research. Papadimitriou and Yannakakis showed that many natural optimization prob- 
lems relating to stochastic games are hard [?]. These results don't apply to bounded 
memory games. Golovin and Krause recently showed that a simple greedy algorithm 
can be used when a stochastic optimization problem satisfies a property called adaptive 
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submodularity [?]. In general, bounded memory games do not satisfy this property. 
Even-Dar, et al., show that regret minimization is possible for a class of stochastic 
games (Markov Decision Processes) in which the adversary chooses the reward func- 
tion at each state but does not influence the transitions[?]. They also prove that if the 
adversary controls the reward function and the transitions, then it is NP-Hard to even 
approximate the best fixed strategy. Mannor and Shimkin [?] show that if the adver- 
sary completely controls the transition model (a Controlled Markov Process) then it is 
possible to separate the stochastic game into a series of matrix games and efficiently 
minimize regret in each matrix game. Bounded-memory games are a different subset 
of stochastic games where the transitions and rewards are influenced by both players. 
While our hardness proof shares techniques with Even-Dar, et al.,[?], there are signif- 
icant differences that arise from the bounded-memory nature of the game. We provide 
a detailed comparison in Section]?] 

In a recent paper, Even-Dar, et al., [?] handle a few specific global cost functions 
related to load balancing. These cost functions depend on history. In their setting, the 
adversary obliviously plays actions from a joint distribution. In contrast, we consider 
arbitrary cost functions with bounded dependence on history and adaptive adversaries. 

Takimoto and Warmuth [?] developed an efficient online shortest path algorithm. 
In their setting the experts consists of all fixed paths from the source to the destination. 
Because there may be exponentially many paths their algorithm must use an implicit 
weight representation. Awerbuch and Kleinberg later provided a general framework for 
online linear optimization [?]. In our settings, an additional challenge arises because 
experts adapt to adversary actions. See Section [5]for a more detailed comparison. 

There has been lot of work in regret minimization for repeated games [?]. A closely 
related work is the regret minimizing audit mechanism of Blocki, et al., [?] that uses 
a repeated game model for the audit problem. It deals with history-dependent rewards 
under certain assumptions about the payoff function, but does not consider history- 
dependent actions. Farias, et al., [?] introduce a special class of adversaries that they 
call "flexible" adversaries. A defender playing against a flexible adversary can mini- 
mize regret by learning the average expected reward of every expert. Our work differs 
from theirs in two ways. First, we work with a stochastic game as opposed to a re- 
peated game. Second, our algorithms can handle a sequence of different fc-adaptive 
adversaries instead of learning a single flexible adversary strategy. A single fc-adaptive 
strategy is flexible, but a sequence of /c-adaptive adversaries is not. 

2 Preliminaries 

Bounded-memory games are a sub-class of stochastic games, in which outcomes and 
states satisfy certain properties. Formally, a two-player stochastic game between an 
attacker A and a defender D is given by {X£i, X^, E, P, r), where and Xjj are 
the actions spaces for players A and D, respectively, E is the state space, P : E x 
Xd X Xa — > [0, 1] is the payoff function and t : S x Xd x Xa x {0, 1}* — > E is 
the randomized transition function linking the different states. Thus, the payoff during 
round t depends on the current state (denoted cr*) in addition to the actions of the 
defender (c?*) and the adversary (a*). We use n — |E| to denote the number of states. 
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A hounded-memory game with memory m. (m G N) is a stochastic game with the 
following properties: (1) The game satisfies independent outcomes, and (2) The states 
S = encode the last m outcomes, i.e., it' = (0'~^, . . . , 0'~"*). An outcome of 
a given round of play is a signal observed by both players (called "public signal" in 
games [?]). Outcomes depend probabilistically on the actions taken by the players. We 
use O to denote the outcome space and O* e O to denote the outcome during round t. 
We say that a game satisfies independent outcomes if O* is conditionally independent 
of (O^, 0*~^) given d* and a*. Notice that the defender and the adversary in a game 
with independent outcomes may still select their actions based on history. However, 
once those actions have been selected, the outcome is independent of the game history. 
Note that a repeated game is a bounded-memory-0 game (a bounded-memory game 
with memory m = 0). 

A game in which players only observe the outcome O* after round t but not the 
actions taken during a round is called an imperfect information game. If both players 
also observe the actions then the game is a perfect information game. 

The history of a game H = {O^, O^, .... O*, ... , O*) , is the sequence of out- 
comes. We use Hk to denote the k most recent outcomes in the game (i.e., = 
^Qt-k+i. . o*)), and t = \H\to denote the total number of roimds played. We use 
iJ* to denote the first i outcomes in a history (i.e., = (O^, . . . , O*)), and H; H' to 
denote concatenation of histories H and H'. 

A fixed strategy for the defender in a stochastic game is a function / : S — )• 
mapping each state to a fixed action. F denotes the set of all fixed strategies. 



3 Definition of Regret 

As discussed earlier, regret minimization in repeated games has received a lot of at- 
tention [?]. Unfortunately, the standard definition of regret in repeated games does not 
directly apply to stochastic games. In a repeated game, regret is computed by com- 
paring the performance of the defender strategy D with the performance of a fixed 
strategy /. However, in a stochastic game, the actions of the defender and the ad- 
versary in round i influence payoffs in each round for the rest of the game. Thus, it 
is unclear how to choose a meaningful fixed strategy / as a reference. We solve this 
conundrum by introducing an adversary-based definition of regret. 

3.1 Adversary Model 

We define a parameterized class of adversaries called fc-adaptive adversaries, where the 
parameter k denotes the level of adaptiveness of the adversary. Formally, we say that 
an agent is k-adaptive if its strategy A{H) is defined by a function / : O* x N — >^ Xa 
such that A{H) = f {Hi, t), where i = t mod (A; -|- 1). Recall that Hi is the i most 
recent outcomes, and t = \H\. 

As special cases we define an oblivious adversary [k = 0) and a fully adaptive ad- 
versary {k = oo). Oblivious adversaries essentially play without any memory of the 
previous outcomes. Fully adaptive adversaries, on the other hand, choose their actions 
based on the entire outcome history since the start of the game. A;-adaptive adversaries 
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lie somewhere in between. At the start of the game, they act as fully adaptive ad- 
versaries, playing with the entire outcome history in mind. But, different from fully 
adaptive adversaries, every k rounds, they "forget" about the entire history of the game 
and act as if the whole game was starting afresh. As discussed earlier, there are nu- 
merous practical instances where fc-adaptive adversaries are an appropriate model; for 
instance, in games in which one player (e.g., a firm) has a much longer length of play 
than the adversary (e.g., a temporary employee), it may be judicious to model the ad- 
versary as fc-adaptive. In particular, fc-adaptive adversaries are similar to the notion 
of "patient" players in long-run games discussed by [?]. Their notion of "fully pa- 
tient" players correspond to fully adaptive adversaries, "myopic" players correspond 
to oblivious adversaries, and "not myopic but less patient" players correspond to fc- 
adaptive adversaries. 

Another possible adversary definition could be to consider a sliding window of size 
fc as the adversary memory. But, because such an adversary can play actions to remind 
herself of events in the arbitrary past, her memory is not actually bounded by fc, and 
regret minimization is not possible. See section 8.3 in the appendix for details. 

and denote all possible if-adaptive strategies for the defender and adver- 
sary, respectively. 



3.2 fc-Adaptive Regret 

Suppose that the defender D and the adversary A have produced history H in a game G 
lasting r rounds. Let a^, denote the sequence of actions played by the adversary. 
In hindsight we can construct a hypothetical fc-adaptive adversary Ak as follows: 

Ak{H')=A{H'-';H^) , 

where t = \H'\ and i = t mod (fc + 1). In other words, the hypothetical fc-adaptive 
adversary replicates the plays the real adversary made in the actual game regardless 
of the strategy of the defender he is playing against, except for the last i rounds under 
consideration where he adapts his strategy to the defender's actions in the same manner 
the real adversary would. 

Abusing notation slightly we write P {f,A,G,ao,T)to denote the expected payoff 
the defender would receive over T rounds of G given that the defender plays strategy 
/, the adversary uses strategy A and the initial state of the bounded-memory game G 
is o-Q- We use P (/, A, G,T) = P (/, A, G, ctq, T) /T to denote the average per-round 
payoff. We use 

Rk[D,A,G,T,S)^uv&^P{f,Ak,G,T)-P{D,Ak,G,T) , 

to denote the k-adaptive regret of the defender strategy D using a fixed set S of experts 
against an adversary strategy A for T rounds of the game G. 

Definition 1. A defender strategy D using a fixed set S of experts is a 7-approximate 
fc-adaptive regret minimization algorithm for the class of games Q if and only if for 
every adversary strategy A, every e > and every game G ^ Q there exists T' > 
such that VT > T' 

Rk {D,A,G,T,S) <e + -f . 
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If ^ = then we simply refer to D as a k-adaptive regret minimization algorithm. If D 
runs in time poly (n, 1/e) we call D efficient. 

fc-adaptive regret considers a fc-adaptive hypothetical adversary who can adapt 
within each window of size (at most) fc + 1. Intuitively, as k increases this measure 
of regret is more meaningful (as the hypothetical adversary increasingly resembles the 
real adversary), albeit harder to minimize. 

There are two important special cases to consider: fc = (oblivious regret) and 
fc = oo (adaptive regret). Adaptive regret is the strongest measure of regret. Observe 
that if the actual adversary is fc-adaptive then the hypothetical adversary Ao^ is same 
as the hypothetical adversary A^, and hence = Rk- Also, if the actual adversary 
is oblivious then R^o = Rq = Rk- 

In this paper Q will typically denote the class of perfect/imperfect information 
bounded-memory games with memory m. We are interested in expert sets S which 
contain all of the fixed strategies F C S. 

4 Hardness Results 

In this section, we show that unless NP = RP no oblivious regret minimization al- 
gorithm which uses the fixed strategies F as experts can be efficient in the imperfect 
information setting. In the appendix (remark [T]i we explain how our hardness reduc- 
tion can be adapted to prove that there is no efficient fc-adaptive regret minimization 
algorithm in the perfect information setting for fc > 1. 

Specifically, we consider the subclass of bounded-memory games Q with the fol- 
lowing properties: \0\ = 0(1), m = O(logn), \Xa\ = 0{l), \Xd\ = 0(1) and 
imperfect information. Any G G is a game of imperfect information (on round t the 
defender observes O*, but not a*) with 0{n) states. Our goal is to prove the following 
theorem: 

Theorem 1. For any /3 > and 7 < l/8n^ there is no efficient ^-approximate oblivi- 
ous regret minimization algorithm which uses the fixed strategies F as experts against 
oblivious adversaries for the class of imperfect information bounded-memory-m games 
unless NP = RP. 

Given a slightly stronger complexity-theoretic assumption called the randomized 
exponential time hypothesis [?] we can prove a slightly stronger hardness result. The 
randomized exponential time hypothesis says that no randomized algorithm running in 
time 2°(") can solve SAT. 

Theorem 2. Assutne that the randomized exponential time hypothesis is true. Then for 
any 7 < 1/ (8 log^ n) there is no efficient ^-approximate oblivious regret minimization 
algorithm which uses the fixed strategies F as experts against oblivious adversaries for 
the class of imperfect information bounded-memory-m games. 

The proofs of Theorems[T]and|2]use the fact that it is hard to approximate MAX3SAT 
within any factor better than | [?]. This means that unless NP = RP then for every 
constant /3 > and every randomized algorithm S in RP, there exists a MAX3SAT 
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instance <j) such that the expected number of clauses in </> unsatisfied by 5'((/)) is > | — /? 
even though there exists an assignment satisfying (1 — /?) fraction of the clauses in (p. 

We reduce a MAX3SAT formula <j) with variables xi, ...,Xn and clauses Ci, ...,Ci 
to a bounded-memory game G described formally below. We provide a high level 
overview of the game G before describing the details. The main idea is to construct G 
so that the rewards in G are related to the fraction of clauses of (f) that are satisfied. 

In G, for each variable x there is a state associated with that variable. The 
oblivious adversary controls the transitions between variables. This allows the oblivi- 
ous adversary Afj to partition the game into stages of length n, such that during each 
stage the adversary causes the game to visit each variable exactly once (each state is 
associated with a variable). During each stage the adversary picks a clause G at ran- 
dom. In G we have 0, 1 G Xjj. Intuitively, the defender chooses assignment a; = 1 
by playing the action 1 while visiting the variable x. The defender receives a reward if 
and only if he succeeds in satisfying the clause G. 

The game G is defined as follows: 
Defender Actions: Xd ~ {0, 1, 2} 
Adversary Actions: Xa {0, 1} x {0, 1, 2, 3} 

Outcomes and States: Each round i produces two outcomes: observe that these 
outcomes satisfy the independent outcomes requirement for bounded-memory games. 
There are n — 2'"+^ states, where cr* is the state at round i. Observe that each state en- 
codes the last m outcomes O and the last outcome 0\ Intuitively, the last m outcomes 
O' are used to denote the variable Xi, while is 1 if the defender has already received 
a reward during the current phase. 

The defender actions 0, 1 
correspond to the truth assign- 
ments 0, 1. The defender re- ^ _ A« _ [ 1 if ^ 2 or d» = a* [2] 
ceives a reward for the correct O — a [l\ and O — 
assignment. The defender is 
punished if he attempts to ob- ^ _ 
tain a reward in any phase after — ( (O* , . . . , O' '"), O* 
he has already received a reward 
in that phase. Once the defender 

has already received a reward he can play the special action 2 to avoid getting punished. 
The intuitive meaning of the adversary's actions is explained below. 

If we ignore the outcome O then the states form a De Bruijn graph [?] where each 
node corresponds to a variable of cj). Notice that the adversary completely controls the 
outcomes O with the first component of his action a[l]. By playing a De Bruijn se- 
quence S — si...Sn the adversary can guarantee that we repeatedly take a Hamiltonian 
cycle over states(for an example see Figure|2]in the appendix). 

RewardsjE] 

^We use payoffs in the range [—1,1] for ease of presentation. These payoffs can easily be re-scaled to lie 
in [0, 1]. 



Otherwise. 
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An intuitive interpreta- 
tion of the reward func- r i if O*- 1 = i and d ^ 2 and a' [2] 7^ 3; 
tion is presented in pai- p U\d\a^) = i 1 if 2 and = d^[2] and O-i = 0; 
allel with the adversary 

strategy. otherwise. 
Adversary Strategy: The 

first component of the 

adversary's action (a[l]) controls the transitions between variables. The adversary will 
play the action a' [2] = 1 (resp. a* [2] 0) whenever the corresponding variable as- 
signment Xi — 1 (resp. Xi ~ 0) satisfies the clause that the adversary chose for the 
current phase. 

If neither variable assignment satis- 
fies the clause (if Xi ^ C and Xi ^ C) • Input: Random string R G {0, 1}* 
then the adversary plays a* [2] = 2. This • Input: MAX3SAT instance 0, with 
ensures that a defender can only be re- variables xi,...,Xn-i , and clauses 
warded during a round if he satisfies the Ci, . . . , Ce. 

clause C, which happens when c?* ~ • De Bruijn sequence: so,...,Sn-i 
c?[2]==0orl. • Round t: Set i t mod n. 

Notice that whenever O = I there 1. Select Clause: If i = then se- 
is no way to receive a positive reward, lect a clause C uniformly at random from 
The defender may want the game G to Ci, using R. 
return to a state where = 0, but un- 2. Select Move: 
less the adversary plays the special ac- 
tion d^[2] — 3 he is penalized when this 
happens. The adversary action a'[2] = 3 j 
is a special 'reset phase' action. By play- 
ing a* [2] = 3 once at the end of each 
phase the adversary can ensure that the 
maximum payoff the defender receives 

during any phase is 1. See Figure 1 for a Figure 1: Oblivious Adversary: Ar 
formal description of the adversary strat- 
egy- 
Analysis: At a high level, our hardness argument proceeds as follows: 1. If there is an 
assignment that satisfies {1-/3) fraction of the clauses in (j), then there is a fixed strategy 
that performs well in expectation (see Claim|2]i. 2. If there a fixed strategy that performs 
well in expectation, then any 7-approximate oblivious regret minimization algorithm 
will perform well in expectation (see Claim|3]l. 3. If an efficiently computable strategy 
D performs well in expectation, then there is an efficiently computable randomized 
algorithm S to approximate MAX3SAT. This would imply that NP = RP. The proofs 
of theorem[T]and theorem|2]can be found in the appendix. 

Our hardness reduction is similar to a result from Even-Dar, et al., [?]. They con- 
sider regret minimization in a Markov Decision Process where the adversary controls 
the transition model. Their game is not a bounded-memory game; in particular it does 
not satisfy our independent outcomes condition. The current state in thek game can 
depend on the last n actions. In contrast, we consider bounded-memory games with 
m = O (log n), so that the current state only depends on the last m actions. This makes 
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it much more challenging to enforce guarantees such as "the defender can only receive 
a reward once in each window of n rounds" — a property that is used in the hardness 
proof. The adversary is oblivious so she will not remember this fact, and the game 
itself cannot record whether a reward was given m + 1 rounds ago. We circumvented 
this problem by designing a payoff function in which the defender is penalized for al- 
lowing the game to "forget" when the last reward was given, thus effectively enforcing 
the desired property. 



5 Regret Minimization Algorithms 



In section pT| we present a reduction from bounded-memory games to repeated games. 
This reduction can be used to create a fc-adaptive regret minimizing algorithm (see 
section 



8.1 in the appendix). This is significant because there is no fc-adaptive regret 
minimization algorithm for the general class of stochastic games. A consequence of 
Theorem[T]is that when the expert set includes all fixed strategies F we cannot hope for 



an efficient algorithm unless NP = RP. In section 5.2 we present an efficient approxi- 
mate 0-adaptive regret minimization algorithm for bounded-memory games of perfect 
information. The algorithm uses an implicit weight representation to efficiently sam- 
ple the experts and update their weights. Finaly, we show how this algorithm can be 
adapted to obtain an efficient approximate 0-adaptive regret minimization algorithm 
for bounded-memory games of imperfect information. 

5.1 Reduction to Repeated Games 

All of our regret minimization algorithms work by first reducing the bounded-memory 
game G to a repeated game p (G, K). One round of the repeated game p (G, K) cor- 
responds to K rounds of G. Before each round of p (G, A') both players commit 
to an adaptive strategy. In p (G, K) the reward that the defender gets for playing a 
strategy / e is the reward that the defender would have received for using the 
strategy / for the next K rounds of the actual game G if the initial state were do: 
P{f,g,p{G,K))=P{f,g,G,ao,K). 

The rewards in p (G, K) may be different from the actual rewards in G because the 
initial state before each K rounds might not be (Tq- In the appendix we show that this 
difference is small (see claim[4|. 

The key idea behind our fc-adaptive regret minimization algorithm BW is to reduce 
the original bounded-memory game to a repeated game p (G, K) of imperfect informa- 
tion {K = mod fc). In particular we obtain the regret bound in Theorem |3] Details 
and proofs can be found in the appendix. 

Theorem 3. Let G be any bounded-memory-m game with n states and let A be any 
adversary strategy. After playing T rounds ofG against A, B\N (G, K) achieves regret 
bound 

R,iB^N,A,G,T,S) < ;^+4^^, 
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where N — \S\ is the number of experts, A is the adversary strategy and K has been 
chosen so that K = T^/"^ and K = Q mod k. 

Intuitively, the m/T^^'^ — m/ K term is due to modeling loss from Claim|4]and 
the other term comes from the standard regret bound of [?]. 

5.2 Efficient Approximate Regret Minimization Algorithms 

In this section we present EXBW (Efficient approximate Bounded Memory Weighted 
Majority), an efficient algorithm to approximately minimize regret against an oblivious 
adversary in bounded-memory games with perfect information. The set of experts 
8 used by our algorithms contains the fixed strategies F as well as all A'-adaptive 
strategies (K — m/7). We prove the following theorem 

Theorem 4. Let G be any bounded-memory-m game of perfect information with n 
states and let A be any adversary strategy. Playing T rounds of G against A, EXBW 
runs in total time TrP^^/'^^ and achieves regret bound 

( m /=^nlog(7V)\ 
j?o(EXBW,A,G,r,g)<7 + ^^' \ , 

where K has been set to m/^ and N — \A^\ — is the number of K- 

adaptive strategies. 

In particular, for any constant 7 there is an efficient 7-approximate 0-adaptive re- 
gret minimization algorithm for bounded-memory games of perfect information. We 
can adapt this algorithm to get EXBWII (Efficient approximate Bounded Memory 
Weighted Majority for Imperfect Information Games), an efficient approximate 0- 
adaptive regret minimization algorithm for games of imperfect information using a 
sampling strategy described in the proof of theoremjS] 

Theorem 5. Let G be any bounded-memory-m game of imperfect information with n 
states and let A be any adversary strategy. There is an algorithm EXBWI I that runs in 
total time Tn^'^^^''^ playing T rounds of G against A, and achieves regret bound 

i?o(EXBWII,AG,r,g)<27 + T ' 

where K has been set to m/^ and N = \A^\ = is the number of K- 

adaptive strategies. 

The regret bound of Theorem |4] is simply the regret bound achieved by the stan- 
dard weighted majority algorithm [?] plus the modeling loss term from Claim|4] The 
main challenge is to provide an efficient simulation of the weighted majority algo- 
rithm. There are an exponential number of experts so no efficient algorithm can ex- 
plicitly maintain weights for each of these experts. To simulate the weighted majority 
algorithm EXBW implicitly maintains the weight of each expert. 
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To simulate the weighted majority algorithm we must be able to efficiently sample 
from our weighted set of experts (see Sample {£)) and efficiently update the weights 
of each expert in the set after each round of p (G, K) (see update weight stage of 
EXBW). 

Meet the Experts Instead of using F as the set of experts, EXBW uses a larger set 
of experts £ (F C £). Recall that a X-adaptive strategy is a function / mapping 
the K most recent outcomes Hk to actions. We use a set of X-adaptive strategies 
E = {/cr : cr € S} C to define an expert E in p (G, K): if the current state of the 
real bounded-memory game G is tr then E uses the X-adaptive strategy in the next 
round of p (G, K) (i.e., the next K rounds of G). £ denotes the set of all such experts. 
Maintaining Weights for Experts Implicitly To implicitly maintain the weights of 
each expert E G £ we use the concept of a game trace. We say that a game trace 
p = a,d},0^, G'~\d* is consistent with an expert E if (0\ O^^^) = 

for each j. We define the set C (E) to be the set of all such consistent traces of 
maximum length K and C — Ueef ^ (^) denotes the set of all traces consistent with 
some expert E G £. EXBW maintains a weight Wp on each trace p E C. The weight of 
an expert E is then defined to be We = Y[pec{E) ^p- 

Given adversary actions a ~ ai, ...,aK and a trace p = a,d^ , rf'"^, 0'~^, 
we define TZ (a, a' ,p). 

Intuitively, Tl(d,a',p) is the 
probability that each outcome of p 

would have occurred given the ad- (q ^ ^ 

versary actions were a and the initial ^(S, cr,p) = '| Pr \0^\a^ d^] otherwise- 
state was cr'. We use a, cr') to Uij<» I"' J ° erwise, 
denote the payment that the defender 
received for playing (the last ac- 
tion in p). Formally £ (p, a, cr') = P (ct^, d*, a') TZ (a, cr' ,p), where cr^ denotes the 
state reached following the trace p (after observing outcomes O^, ...,0'^^ starting 
from CTo) and is the final defender action in the trace. Notice that in the imperfect 
information setting the defender could not compute £ because he would not observe the 
adversary's actions a. 

Updating Weights Efficiently While updating weights EXBW maintains the invariant 

thatWp — (3^i=^ i{p,a\cr' ) ^ where cr-'^ is the State of G after jiiT rounds and 5* is the 
actions the adversary played during the j'th round of p (G, K). The standard weighted 
majority algorithm maintains the invariant that We = P^^=^ P{E,a ,p(g,k)) ^ the 
appendix EXBW also maintains this invariant(see claim|5]). 

Sampling Experts Efficiently We can also efficiently sample from £ using dynamic 
programming (see Sample {£)). Using the notation p ^ p' for p' extends p we can 
define Wp. Intuitively, Wp-o-d represents the weight of the action d from history p; O. 

Using dynamic programming we can effi- 
ciently compute Wp for each trace p because there 

are only n'^'^^/'''' such traces. Using the weights ^ _ TT 

we can efficiently sample from We use " E:peC(E)p'eC{E)^p^p' 
p;0\d\.o denote a new game trace which contains 
all of the outcomes/actions in p appended with O 



•p' 
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Algorithm: EXBW (7, G) 

• Initialize: K ^ m/^ 

• Construct: p {G, K) 

• Each Round: 

1. (T -s— G.C'urrentState 

2. E ^ Sample {£) 

3. Play E 

4. Observe adversary actions a = 
a\,...,a'^. 
5. Update Weights: For each p e C 
Compute £ {p, a, a) 
Set Wp^WpX /3^feS,<T)^ 

Algorithm: Sample (£) 

• For each trace p e C recursively 
compute Wp using the formula: 

oeo deXo 

• Build Strategy E: For each p £ C 
and O e O, randomly select d e Xjj 

Pr[d\p, 0] = ^^^^ . 

• E play d any time it observes history 

p;0- 

In the appendix we prove that Sample {£) outputs each expert E with probability 
proportional to We (see claim|6]l. Given Sample {£) it is straightforward to simulate 
the standard weighted majority algorithm. To update weights EXBW simply loops 
through all traces p <E C applying the update rule Wp — Wp x , where j3 is 

a learning parameter we tune later. The full proof of Theorem [4] can be found in the 
appendix. 

At a high level our algorithm is similar to the online shortest path algorithm devel- 
oped by Takimoto and Warmuth [?]. In their work, they consider the set of all source- 
destination paths in a graph as experts. Since there are exponentially many paths they 
also maintain the weights of the experts implicitly. In their setting, the defender com- 
pletely controls the chosen path. In contrast, our experts adapt to adversary actions. 
The challenge was constructing a new implicit weight representation which works for 
if -adaptive strategies. 

Using this implicit weight representation we could have also used the general 
barycentric spanner approach to online linear optimization developed by Awerbuch 
and Kleinberg [?] to design a 7-approximate 0-adaptive regret minimization algorithm 
running in time n'^'^^/''^ However, we are able to achieve better regret bounds in the- 
orem]?] by simulating the weighted majority algorithm. Awerbuch and Kleinberg [?, 
Theorem 2.8] achieve the average regret bound O (Md^/'^/T^/'^), where d is the di- 
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mension of the problem space and M is a bound on the cost vectors. By comparison 
our regret bounds in Theorems and 1] tend to with 1/^/T. In our setting, the di- 
mension of the problem space is, d — O (n'^/'^^) (the number of nodes in the decision 
tree), and M = K = m/7 is the upper bound on the cost vector in each round of 

p (G, K). The average regret bound would be O (^^n^/(^f') /T^/^^ . the regret bound 

is proportional to \/ n}/^ /T. By comparison Theorem[4]has a %/ n^/'i in the numerator. 

The standard regret minimization trick for dealing with imperfect information in 
a repeated game is to break the game up into phases and perform random sampling 
in each round to estimate the cost of each expert and update weights. The challenge 
in adapting EXBW is that there are exponentially many experts in £. Our key idea 
was to estimate I {p, a, a) for each p e C so there are only n'-'^-^/''^ samples to take in 
each phase. We can then update the implicit weight representation using the estimated 
values £ {p, a, a). 

6 Open Questions 

In this paper, we defined a new class of games called bounded-memory games, in- 
troduced several new notions of regret, and presented hardness results and algorithms 
for regret minimization in this subclass of stochastic games. Because both the games 
and the notions of regret we study in this paper rely on novel definitions, they raise a 
number of interesting open problems: (1) To what extent can the hardness results of 
Theorems[T|and|2]be further improved? (7 = 1 /log n7) Could similar hardness results 
apply to games with perfect information? (2) Is there an efficient non-approximate 
oblivious regret minimization algorithm for bounded-memory games with perfect in- 
formation? (3) Is there a 7-approximate oblivious regret minimization algorithm with 
running time n°^^/'''>7 For example, could one design a 7-approximate oblivious regret 
minimization algorithm with running time i°g7 7 (4) Pqj- repeated games (m — 0) 
is there an efficient 7-approximate /c-adaptive regret minimization algorithm if we use 
as our set of experts {K = log n)? 

7 Hardness Reduction: Proof of Claims 

This section contains the proofs of the lemmas and theorems from section]?] 

Claim 1. Fix a polynomial p(-) and let a — n ■ [P (Z), A/j, G, T)] , where T = 
p{n) and D is any polynomial time computable strategy. There is a polynomial time 
randomized algorithm S which satisfies a fraction of the clauses from (j) in expectation. 

Proof. Let p{o) be given such that T [D] < p{n) and set 

a^nx Er[P {D, Ar, G, T)] . 

We present S ( Algorithm [TJ - an algorithm to recover the variable assignment. S 
runs in time 

T{S)^0{p{nf) . 
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During the simulation we present D with (potentially) false history in each stage, 
where the defender always thinks he hasn't satisfied the clause C. Let be the 
expected fraction of clauses satisfied in stage j of the simulation. We define the random 
variable to be the reward D earns in stage j in the actual game. Observe that the 
game is structured so that two rewards during the same stage must be separated by 
a penalty. When the defender receives a reward the outcome 0*~^ is produced. If 
the defender wishes to avoid an offsetting penalty then he must keep producing the 
outcome by playing = 2, preventing him from receiving an award for the rest 
of the stage. The maximum payout a defender strategy D can receive during any stage 
is 1 so Xj G {0, 1}. Because of imperfect information the defender cannot learn any 
information about the clause the adversary has selected. We have 



In particular 



E[Xj] = Pr[X,- = 1] = E[Yj] . 



T/n 



so there exists a round j such that -E[Yj] > a. Let Y denote the number of clauses 
satisfied by S, then 



Y = maxYj , 
j 



so we have 

E[Y] > a . 



□ 



Claim 2. Suppose that there is a variable assignment that satisfies {1 — j3) ■ £ of 
the clauses in (p. Then there is a fixed strategy f such that Er [P (/, Ar, G, n)] > 
(1 — (5) /n , where R is used to denote the random coin tosses of the oblivious adver- 
sary. 

Proof Let xi*, ...,a:„_i* be the assignment that satisfies at least (1 — /3) fraction of 
the clauses and let sq, Sn-i be the De Bruijn sequence played by the adversary. Xn 
is an additional variable that is not in any of the clauses. Then the on round t we have 

^ — ^(-^i— 1 mod ■■■7 ^i—m mod n)? O ^ , 

where i = t mod n so both these states are associated with the variable Xi. For 
< i < n we set 

f ii'^i— 1 mod m ^2— m mod n) 5 0) * * 

To avoid taking a penalty we set 

/ {{^i—l mod •*•) ^i—m mod n); 1) ^ ; 
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for < i < n. For i = we set 



f (('^4—1 mod m •■■7 '^i — m mod n); 1) ; 

to produce the outcome — (recall that the adversary will play a* = {sq, 3) when- 
ever t = mod n so we can avoid the penalty). The fixed strategy / will receive 
reward 1 in stage j if and only if xi*, Xn-i* satisfies the clause Cj chosen in stage 
j- 



ER[Pif,AR,G,n)] > ^^—^ (1) 

□ 

Claims. Suppose that D is an — ^^-approximate oblivious regret minimization 
algorithm against the class of oblivious adversaries and there is a variable assignment 
that satisfies (1 — /?) fraction of the clauses in (j). Then for T — poly(n) 

Er[P{D,Ar,G,T)\ >^ + -, 
^ 8n n 

where R is used to denote the random coin tosses of the oblivious adversary. 
Proof By Claim[2]there is a fixed strategy with 

En[PiD,An,G,T)] > . 

n 

Set e = P/n, and apply definition [T] to get 

P{f,AR,G,T)-P{D,An,G,T)< ( ^ - ^) + /3/n , 



8n n 

for any random string R (adversary coin flips). This means that 



Er [P if, Ar, G, T)] - Er [P {D, Ar, G, T)] < 
Rearranging terms 



8n n J n 



Er[p{d,Ar,g,t)] > (L^_ 1 + M 

n an n 
8n n 



□ 



Before we prove Theorem[T]we will first prove an easier Lemma using these claims. 
The proof of Lemma[T|can be easily adapted to prove Theorems [T] and [2] Details can 
be found in the appendix. 
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Lemma 1. Unless NP = RP, /or 7 < l/8n f/zere is no efficient ^-approximate oblivi- 
ous regret minimization algorithm which uses the fixed strategies F as experts against 
oblivious adversaries for bounded-memory-m games of imperfect information. 

Proof of Lemma^ Suppose that D were an efficient 7-approximate oblivious regret 
minimization algorithm and consider the polynomial time randomized algorithm S. 
Combining Claim |3] and Claim [l] for every MAX3SAT formula with > (1-/3) 
fraction of the clauses satisfiable S satisfies > | + /3 fraction of the clauses from (f) in 
expectation. This would imply that NP = RP [?]. DThe proof of Theorem [T] is very 
similar to the proof of Lemma[T] 

Reminder of Theorem [l| For any /3 > and 7 < 1 /8n^ there is no efficient 
^-approximate oblivious regret minimization algorithm which uses the fixed strate- 
gies F as experts against oblivious adversaries for the class of imperfect information 
bounded-memory-m games unless NP = RP. 

Proof of Theorem^ The key point is that if an algorithm S runs in time O {p{n)) 
on instances of size for some polynomial p{n) then on instances of size n S runs 
in time O {p (n^/'^)) which is still polynomial time. Unless NP = RP Ve, /3 > and 
every algorithm S running in time poly(n), there exists an integer n and a MAX3SAT 
formula with variables such that 

1. There is an assignment satisfying at least (1 — e) of the clauses in (j). 

2. The expected fraction of clauses in cj) satisfied by is < | + e. 

If we reduce from a MAX3SAT instance with variables we can construct a game 
with 0{n) states {■n}~^ copies of each variable). One Hamiltonian cycle would now 
corresponds to ■n}~^ phases of the game. This means that the expected average reward 
of the optimal fixed strategy is at least 

me^^En [P [f, Ar,G,T)] > ^ , 

f<£F n 

while the expected average reward of an efficient defender strategy D is at most 

Er [P(AA«,G,T)] 

n 

Therefore, the expected average regret is at least 

Ro{D,Ar,G,T,F)> (^-2e) . 

□ 

While the proof of Theorem |2] makes use of the randomized exponential time hy- 
pothesis the argument is similar to the proof of Theorem[T] 

Reminder of Theorem[2j Assume that the randomized exponential time hypothesis is 
true. Then for any 7 < 1/ (Slog^ nj there is no efficient ^-approximate oblivious re- 
gret minimization algorithm which uses the fixed strategies F as experts against obliv- 
ious adversaries for the class of imperfect information bounded-memory-m games. 
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Proof of Theorem^ (sketch) Assume that the randomized exponential time hypoth- 
esis holds. Then because it is NP-hard to approximate MAX3SAT within any factor 
better than | [?] no randomized algorithm which satisfies > g + e of the clauses in a 
MAX3SAT instance in expectation can run in time 

2o(n) ^ 



Now we argue that it is sufficient to reduce from a MAX3SAT instance with n' = 
log^ n variables (instead of variables). One Hamiltonian cycle now corresponds to 



log n 



phases of the game. Our bounded-memory game G has n states then any efficient 
7-approximate regret minimization algorithm S must run in time O (n*^) for some 
constant k. If the randomized exponential time hypothesis holds then the expected 
average reward of an efficient defender strategy D is at most 



Er[P{D,Ar,G,T)] < 



smce 



However, if the MAX3SAT formula was satisfiable then the expected average re- 
ward of the optimal fixed strategy is at least 

(1-^) 1-6 



log- n 



maxi?^[P(/,A^,G,T)] > 2 ■ 

Therefore, the expected average regret is at least 

log n 

Assume for contradiction that 7 < ^ ^^^2 „ then S can be adapted to satisfy > | + e 
of the clauses in MAX3SAT with running time 



n 



This contradicts the randomized exponential time hypothesis. 

□ 

Remark [T] how our hardness reduction can be adapted to prove that there is no 
efficient fc-adaptive regret minimization algorithm in the perfect information setting 
fc > 1. 

Remark 1. In bounded-memory games of perfect information we can replace the obliv- 
ious adversary Ar in figure ^with a 1-adaptive adversary and essentially the same 
reduction will still work. We only need to make a few small modifications. The states of 
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Figure 2: De Braijn example 



the game will be modified to store the defenders last action. The adversary again plays 
a Hamiltonian cycle through the states in each phase. Now the first two states we visit 
correspond to the variable xi, the next two visited states will correspond to X2, etc. If 
the defender plays actions 1 and 1 (resp. and 0) while visiting the variable xi then 
this corresponds to assigning xi to true (resp. false). If the defender plays 1 and (or 
and I ) which corresponds to no assignment then the adversary strategy will ensure 
that he cannot receive a reward. 

The 1-adaptive adversary will always play a* [2] — 2 on even rounds (t = 
mod 2) and on odd rounds the adversary will adaptively select a* [2] = d*~^ if the 
defender's last action satisfied the chosen clause C, otherwise a* [2] = 2. The defender 
receives a reward only if (1) he plays a consistent assignment during both rounds (2) 
the assignment satisfies the chosen clause C and (3) he has not already received a 
reward during this phase. Now Claim [7] still holds because a defender will always 
observe the adversary action a* [2] — 2 until he satisfied the clause C. 

7.1 Transition Example 

By playing a De Bruijn sequence S — si...s„ the adversary can guarantee that we 
repeatedly take a Hamiltonian cycle over states. For example, considering 8 states 
and starting from xq, the sequence 10111000 corresponds to the Hamiltonian cycle 

Xo, Xl,X2,X'i,X3, X7, Xe, X4 

8 Regret Minimization Algorithms 

8.1 Regret Minimization Algorithm with Imperfect Information 

We present BW (Bounded Memory Weighted Majority), an algorithm that minimizes 
/c-adaptive regret for bounded-memory games. This result is significant because there is 
no fc-adaptive regret minimization algorithm for the general class of stochastic games(see 
Theorem|6]in the appendix). A consequence of Theorem[T]is that when the expert set 
includes all fixed strategies F we cannot hope for an efficient algorithm unless NP = 
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RP. Indeed, our algorithm would not be efficient in this case because it would have to 
explicitly maintains weights for exponentially many fixed strategies |F| = 

The key idea behind our fc-adaptive regret minimization algorithm BW is to reduce 
the original bounded-memory game to a repeated game p (G, K) of imperfect infor- 
mation {K = mod k). BW uses the Exp3 regret minimization algorithm of [?] for 
repeated games of imperfect information. In particular, BW uses the strategies selected 
by Exp3 in each round of p (G, K) to play the next K rounds of G. BW feeds Exp3 
the hypothetical losses from p (G, K) to update the weights of each expert. 

Reminder of Theorein[3j Let G be any bounded-memory-m game with n states and 
let A be any adversary strategy. After playing T rounds of G against A, BW (G, K) 
achieves regret bound 

where N = \S\ is the number of experts, A is the adversary strategy and K has been 
chosen so that K = T^l^ and K = Q mod k. 

Proof of Theorem |i] (Sketch) The proof of theorem uses standard regret bound for 
regret minimization algorithms in games of perfect information [?]. After playing T 
rounds {T / K rounds of p (G, K)) we have 



P{D,Ak,p{G,K),T/K)-P{J,Ak,p{G,K),T/K) > -4. 



^KN logN 
T/K 



for all fixed strategies f € F. Here, A'^ is the number of experts 

N^\F\^\XDf\ , 

and K also denotes the maximum payout in any round of p (G, ii'). Because K was 
chosen such that K = Q mod k the adversary Ak is always in phase with p (G, K) 
and we can apply Claim|4]to get Theorem[3] □ 
In particular, BW is a fc-adaptive regret minimization algorithm for the class of 
bounded-memory games in the sense of Definition [l]because — > as T — > oo. 

Remark 2. B W is inefficient when number of experts f S is exponential in n, the 
number of states in G. For example, if S — F then \F\ = For small values of n 

( example: for repeated games n = 1) it will still be tractable to run B W with S = F. 



8.2 Proofs of Claims and Theorems 

This section contains the proof of claims and theorems from section|5] 

Claim |4]bounds the difference between the hypothetical losses from p (G, K) and 
actual losses in G using the bounded-memory property. 

Claim 4. For any adaptive defender strategy f G and any adaptive adversary 
strategy g £ and any state a ofG we have \P (/, G, ct, K) — P {f, g,G,ao, K)\ < 
m . 
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Proof of Claim |4] (Sketch) Once the defender selects / and the adversary selects 
strategy g G K — AD APT a, the actions of the adversary and the defender are fixed for 
the next K rounds of G. Let , . . . , (resp. , . . . , a^) denote the actions taken by the 
defender (resp. adversary). Once ■■■,Rk (the random coins used by the outcome 
function) are fixed then the outcomes O^, are also fixed. Let cr^, cr^^ states 
encountered in the actual game and let ci, be the states that we would have 
encountered if we had started at ctq as in p (G, K). In a bounded-memory property 
game the state encodes the last m outcomes, but the outcomes do not depend on the 
starting state so we have 

for all j > m. This means that for j > m 



Consequently, 



\P{f,g,a,G)-P{f,g,ao,G)\ = 



t=i 

m— 1 



Y,P{dt,at,a')-P{dt 



< m . 



The standard weighted majority algorithm maintains the invariant that W^£; = /3^j=i ^(^'"^ ,p(G,Ar)) 
Claim|5]says that EXBW also maintains this invariant. 

Claim 5. 

Proof of Claim^ First notice that we can write 

T/K T/K 

Y^P{E,a\p{G,K))^ ^ ^f(p,a^a^^), 

J = l P^C(E) j=l 

since the overall payoff of an expert E can be expressed as a sum of the individual 
immediate payoffs after each action. 



peC{E) 

Claim|6]says that Sample (£) samples from the right distribution. 



□ 
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Claim 6. For each expert E E £ Algorithm Sample (£) outputs E with probability 

Pr [E] cx We ■ 

Proof. Given a trace p — po; O; d let Chosen {po; O) be the event that the strategy 
output by Algorithm Sample {£) plays d from given history po; O. 



Pr [Output E] = Yl [Chosen {p; O) = E (p; O)] 

pecfieo 

_ TT Wp-o-d 

T,d'eXn^p--0;d' 



^E':{p-0;d)eC{E') Yip' eC{E')Ap;0;dtZp' ^P' 



pec,oeo,d=E(p,o) 

peC,Oe0.d=E{p,0) ^d'eXo ^E':{p-0;d')eC(E') Up' eC{E')Ap-0;d'!Zp' ^P' 

j-j- J2E':{p;0;d)eC(E')Ylp'eC{E')Ap;0;dnp' ''^P' ^ Ilp'IZp ^p' 

peC,OeO,d=E{p,0) ^d'eXo T.E':(p;0;d')eC(E') Ylp'eC(E')Ap;0;d'^p' ^P' Ylp'tzp ^P' 
■J-j- J2E':{p-0:d)eC{E') Ylp'eC{E') ^P' 

pec,oeo,d=E(p,o) ^d'eXo T.E':(p-0:d')ec(E') Ylp'ec(E') ^p' 

-Q TjE':(p;0;d)eC(E') ^E' 



Hd'eXn ^E':(p-0;d')eC(E') ^E' 



pec,oeo,d=Eip,o) ^d'eXo ^E':{p-0;d')ec{E') 



n 



J2E':{p;0;d)eC{E') ^E' 



pec,oeo,d=E{p,o) et 

We 
EE'ee We' ■ 

□ 

Reminder of Theorem|4j Let G be any bounded-memo ry-m game of perfect informa- 
tion with n states and let A be any adversary strategy. Playing T rounds of G against 
A, EXBW runs in total time TnP^^^'^^ and achieves regret bound 



Ro (EXBW, A,G,T,£) <j + \ — 



m l^n\og{N) 



1 V T 

where K has been set to 771/7 N = \A^\ = {\X£)\)" is the number of K- 
adaptive strategies. 

Proof of Theorem^ By Claims |5] and |6] Algorithm EXBW perfectly simulates the 
weighted majority algorithm [?]. Notice that there are N'"- experts in £ and we are 
playing T/K rounds of p (G, K). The maximum payment in round of p (G, A') is 



22 



K = m/7. The regret bound immediately follows from Claim|4](the 7 = m/K term) 
and the standard regret bound from [?] after setting 



. 1 / nln(jV) ^ 
/3 = mm{-,y }. 

The regret bound holds against all experts i? e f so in particular the regret bound 
also holds against all fixed experts f £ F since F C £. 

The running time of EXBW is proportional to the number of traces in C. There are 
only n'-'^-^/i^ total traces in C so for any constant 7 the running time is polynomial. □ 

Reminder of Theorem [5j Let G be any bounded-memory-m game with n states and 
let A be any adversary strategy. After playing T rounds of G against A, BW (G, K) 
achieves regret bound 

RUBW,A,G,T,S) < ^+4^^^, 

where N — \S\ is the number of experts, A is the adversary strategy and K has been 
chosen so that K = T^/^ and K = Q mod k. 

Proof of Theorem 5 (Sketch) We group the rounds of p (G, K) into phases of - 
rounds. Each phase now corresponds to 



1/t 

7 



K- 



9 ' 

7 7 

rounds ofQ. As before there are experts. 

Within a single phase let {i — 1, 77,^/^/7) denote the actions of the adversary 
during round i of that phase. To update our implicit weight representation we would 
like to compute 



for each p e C However, we do not know the adversary actions a* in each phase. 
Instead of computing 

i 

we will estimate this quantity. For each 

we will play the defender actions d in a randomly chosen round of the phase. Let O 
and I = (^1, (-m/'y) denote the observed outcomes and payoffs in this round and let 
p' be the path corresponding to the first j defender actions from d and outcomes from 
O. For each path p^ we set 



7 



23 



If the path p never occured during a sampling round of the phase then we set 

£' {p>,a) = 0. 



For each path p e C we have 



,1/7 



77 

7 



where the expectation is taken over the random selection of sampling rounds. Now we 
can use the estimated losses i' to maintain our implicit weight representation. 

The following factors explain why the final regret bound is slightly worse than the 
bound in the perfect information setting (Theorem Hh: 



1 . We spend at most 



1 /f 

rounds of each phase sampling. There are rounds in a phase so the average 
sampling loss per round is at most 

nV7 



m 

This is in addition to modeling loss (7) from claim|4] In the perfect information 
setting there is no sampling loss just the modeling loss. 

2. We are only now only updating weights after each phase. If T is the number 
of rounds of the bounded-memory game G that we play then we only update 
weights T' times where 

In the perfect information setting we had T' = ^ . 

3. The maximum loss in each phase is now the length of a phase 

7 V 7 

instead of the length of a round m/7. 

□ 
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Remark 3. Because repeated games are a subset of bounded-memory games, EXBW 
(resp. EXBWIIj could also be used to minimize oblivious regret in a repeated game 
of perfect information (resp. imperfect information) using as experts. In this case 
there is no modeling loss from claim |4] so the guarantee is that we perform as well 
as the best K-adaptive defender strategy in hindsight. As long as K = O (log n) the 
running time of our algorithms will be time polynomial in n. 

8.3 Impossibility of Regret Minimization in Stochastic Games 

Stochastic Games Stochastic games are a generalization of repeated games, in which 
the payoffs depend on the state of play. Formally, a two-player stochastic game be- 
tween an attacker A and a defender D is given by {Xjj,Xji, 'S,P,t), where Xa and 
Xd are the actions spaces for players A and D, respectively, S is the state space, 
P : E X Xd X Xa ^ [0, 1] is the payoff function and t : E x Xd x Xax {0, 1}* — E 
is the randomized transition function linking the different states. 

Thus, the payoff during round t depends on the current state (denoted cr*) in addi- 
tion to the actions of the defender (d*) and the adversary (a*). This added flexibility 
enables us to develop realistic game models for interactions where the rewards depend 
on game history. The hospital-employee interaction we introduced earlier is one exam- 
ple of such an interaction: an employee committing a given violation for the first time 
is unlikely to meet the same punishment as an employee committing the same violation 
for the tenth time. 

A fixed strategy for the defender in a stochastic game is a function / : E Afjj 
mapping each state to a fixed action. F denotes the set of all fixed strategies. 

In this section we demonstrate that there is no regret minimization algorithm for 
the general class of stochastic games. More specifically for every notion of regret k 
(oblivious (k — 0), fc-adaptive, fully adaptive (k = oo)) there is no fc-adaptive mini- 
mization algorithm for the class of stochastic games. It suffices to consider 'oblivious 
regret' against an oblivious adversary (see remark]?]). The example in Theorem ]6] is 
fundamentally similar to example IV. 1 of [?]. 

Theorem 6. There is a stochastic game G such that for any defender strategies D there 
exists an oblivious adversary A such that 

lim Rk {D,A,G,T) > . 

Proof. In particular, consider the stochastic game G illustrated in Figure ]3] The figure 
shows a game with two players D and A with action sets Xd = {di, ^2} and Xa = 
{tti, CL2} respectively. The reward function for the defender depends only on his own 
action as well as the current state a. Observe that (T2 is a sink state which the game 
can never leave. If the game reaches this state then the defender will be continuously 
rewarded in every round for the rest of the game. However, the only way to reach 0-2 is 
if the defender and the adversary play (di, oi) simultaneously in some round t. If the 
defender fails to play di then he might permanently miss his opportunity to reach CT2. 
This suggests that the defender must always play di . However, if the adversary never 
plays oi then it is best to use the fixed strategy always play ^2- D 
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Notice that for any A E and any defender strategy D we have 

PiD,A,G,T)^PiD,Ao,G,T) , 

because A — Aq. Hence, Rq ~ Rk whenever the adversary is obUvious. 

Remark 4. 1. If D can minimize k-adaptive regret against any k-adaptive adver- 
sary then D can minimize k-adaptive regret against any oblivious adversary 
(k = 0) because 

2. If D can minimize k-adaptive regret against any k-adaptive adversary then D can 
minimize k-adaptive regret against any oblivious adversary because Rq = Rk 
whenever the adversary is oblivious. 

3. If D is a k-regret minimization algorithm a class of games Q and Q' is a subclass 
of Q then D is also a k-regret minimization algorithm for the class of games Q' . 




d^, * 

P(di,ai) = -1 P(di,a2) = l 
P(rf2,^Tl)=0 F(d2,f72)-1 

Figure 3: A counterexample to prove Theorem |6] 

This example also illustrates why it is impossible to minimize fully adaptive regret 
against a non-forgetful adversary. In particular a non-forgetful adversary could use 
the states from |3] to decide whether or not to cooperate. Note that even if the adver- 
sary can only see the last m outcomes (sliding window) the adversary could play to 
remind himself of events arbitrarily long ago. For example, an adversary who wanted 
to remember whether or not the defender played action d during round 1 might play a 
special reminder action every m rounds when the latest reminder is about to go out of 
memory. 
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Algorithm 1 Assignment Recovery 

• Input: D 

• Input: MAX3SAT instance (j), with variables 

X\, . . . , Xn—l ) 

and clauses 

Ci ) • • • > Q ) 

• De Bruin sequence: so,...,Sn-i 

• Initialize: Set i <«- 0, i? 0, T ^ p{n), a* 

• Round t: Set « <— t mod n 

1. Check 1: If t > T then return. 

2. Check 2: If our current assignment x\, ...,Xn-i satisfies y fraction of the 
clauses where y > a* then set 

Xi^ i Xi , 

and 

a y . 

3. Select Clause: If z = then select a new clause C uniformly at random 
from Ci, Ce, and set H' = 0. 

4. Select Adversary Move: 

!(si,3) ifi = 0; 
if Xi€C; 
(s„0) ifXiSC; 
(s,,2) otherwise. 

5. Select Defender Move: 

6. Update: Let O* be the outcome and set 

H' ^ H' + {si,0) , 
Xi -t^ d\ 
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